{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "57ca0f53-f780-4cc5-8c37-d754c32a97cc",
   "metadata": {},
   "source": [
    "## Long-Form Document Extraction: Mining Information from SEC 10-K/Q Forms\n",
    "\n",
    "Companies listed on the US stock exchanges are required to file annual and quarterly reports with the SEC. These reports are called 10K (annual) and 10Q (quarterly) filings.\n",
    "10K/Q filings are information dense and contain a lot of information about the company's business, operations, and financials.\n",
    "The documents have a loosely defined structure and the reported metrics and sections may differ based on the company's operations. \n",
    "\n",
    "That said, there are enough commonalities that we may want to extract the information in a standardized format for downstream analysis. e.g. this could be \n",
    "used to extract financial metrics for a company and analysis of key risk factors after every earnings release.\n",
    "\n",
    "Let's take a look at Nvidia's 10-K filing for the year 2024. Here's the SEC link for the [10-K filing](https://www.sec.gov/ix?doc=/Archives/edgar/data/0001045810/000104581025000023/nvda-20250126.htm).\n",
    "As you can see, this is a pretty large document with a lot of information to parse through. \n",
    "\n",
    "> **Note:** This principle of what fields generalize across your target documents and what might be optional is an important one to keep in mind when designing your schema. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53c9d9e4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"600\"\n",
       "            height=\"400\"\n",
       "            src=\"./data/sec_filings/nvda_10k.pdf\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x11b2e3850>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import IFrame\n",
    "\n",
    "IFrame(src=\"./data/sec_filings/nvda_10k.pdf\", width=600, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b2b0485",
   "metadata": {},
   "source": [
    "Let us initialize the LlamaExtract client to extract our information of interest from these 10-K filings. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55bfb70e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "from llama_cloud_services import LlamaExtract\n",
    "\n",
    "\n",
    "# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)\n",
    "load_dotenv(override=True)\n",
    "\n",
    "# Optionally, add your project id/organization id\n",
    "llama_extract = LlamaExtract()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3c60767",
   "metadata": {},
   "source": [
    "### 1. Defining the Extraction Schema\n",
    "\n",
    "To begin with, we'll focus on extracting the following information from the 10K/Q filings which are common across different companies:\n",
    "- *Filing Information*: Date of filing, type of filing, reporting period end date, fiscal year, fiscal quarter\n",
    "- *Company Profile*: Name, ticker, reporting currency, stock exchanges, auditor\n",
    "- *Financial Highlights*: Key metrics to assess the company's financial health - revenue, gross profit, operating income, net income, EPS, EBITDA, free cash flow\n",
    "- *Business/Geographic Segments*: Revenue, operating income, year-over-year growth, outlook for each segment.\n",
    "- *Risk Factors*: Key risks as identified by the company management.\n",
    "- *Management Discussion & Analysis (MD&A)*: Key highlights from management discussion and analysis.\n",
    "\n",
    "\n",
    "#### Using Pydantic Models for Schema Definition\n",
    "\n",
    "We can use JSON to define the schema for the extraction or use Pydantic models to encapsulate the schema. In this example, we'll use Pydantic models for schema definition for a few reasons:\n",
    "- **Extensibility**: They are more flexible, easier to extend and maintain. \n",
    "- **Readability**: Pydantic models are more readable (less verbose) and easier to understand. Nested models in particular are easier to read than deeply nested JSON schemas.\n",
    "- **Type Safety**: By validating against the Pydantic model, your code is guaranteed to be type-safe for use downstream an part of an automated process. e.g. an extracted date field will not suddenly become a numeric type.\n",
    "\n",
    "In this case, imagine that you have a daily ETL pipeline that searches for new 10-K/Q filings and extracts the relevant information for these companies. Once the extraction results are available in LlamaExtract, *it is guaranteed to comply with the schema definition and can be sent to the ETL pipeline without worrying about data type mismatches.*\n",
    "\n",
    "We consider some key design considerations for the schema definition below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "899569db",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Literal, Optional, List\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class FilingInfo(BaseModel):\n",
    "    \"\"\"Basic information about the SEC filing\"\"\"\n",
    "\n",
    "    filing_type: Literal[\"10-K\", \"10-Q\", \"10-K/A\", \"10-Q/A\"] = Field(\n",
    "        description=\"Type of SEC filing\"\n",
    "    )\n",
    "    filing_date: str = Field(description=\"Date when filing was submitted to SEC\")\n",
    "    reporting_period_end: str = Field(description=\"End date of reporting period\")\n",
    "    fiscal_year: int = Field(description=\"Fiscal year\")\n",
    "    fiscal_quarter: int = Field(description=\"Fiscal quarter (if 10-Q)\", ge=1, le=4)\n",
    "\n",
    "\n",
    "class CompanyProfile(BaseModel):\n",
    "    \"\"\"Essential company information\"\"\"\n",
    "\n",
    "    name: str = Field(description=\"Legal name of company\")\n",
    "    ticker: str = Field(description=\"Stock ticker symbol\")\n",
    "    reporting_currency: str = Field(description=\"Currency used in financial statements\")\n",
    "    exchanges: Optional[List[str]] = Field(\n",
    "        None, description=\"Stock exchanges where listed\"\n",
    "    )\n",
    "    auditor: Optional[str] = Field(None, description=\"Company's auditor\")\n",
    "\n",
    "\n",
    "class FinancialHighlights(BaseModel):\n",
    "    \"\"\"Key financial metrics from this reporting period\"\"\"\n",
    "\n",
    "    period_end: str = Field(description=\"End date of reporting period\")\n",
    "    comparison_period_end: Optional[str] = Field(\n",
    "        None, description=\"End date of comparison period (typically prior year/quarter)\"\n",
    "    )\n",
    "    currency: str = Field(description=\"Currency of financial figures\")\n",
    "    unit: str = Field(\n",
    "        description=\"Unit of financial figures (thousands, millions, etc.)\"\n",
    "    )\n",
    "    revenue: float = Field(description=\"Total revenue for period\")\n",
    "    revenue_prior_period: Optional[float] = Field(\n",
    "        None, description=\"Revenue from comparison period\"\n",
    "    )\n",
    "    revenue_growth: float = Field(description=\"Revenue growth percentage\")\n",
    "    gross_profit: Optional[float] = Field(None, description=\"Gross profit\")\n",
    "    gross_margin: float = Field(description=\"Gross margin percentage\")\n",
    "    operating_income: Optional[float] = Field(None, description=\"Operating income\")\n",
    "    operating_margin: Optional[float] = Field(\n",
    "        None, description=\"Operating margin percentage\"\n",
    "    )\n",
    "    net_income: float = Field(description=\"Net income\")\n",
    "    net_margin: Optional[float] = Field(None, description=\"Net margin percentage\")\n",
    "    eps: Optional[float] = Field(None, description=\"Basic earnings per share\")\n",
    "    diluted_eps: Optional[float] = Field(None, description=\"Diluted earnings per share\")\n",
    "    ebitda: Optional[float] = Field(\n",
    "        None,\n",
    "        description=\"EBITDA (Earnings Before Interest, Taxes, Depreciation, Amortization)\",\n",
    "    )\n",
    "    free_cash_flow: Optional[float] = Field(None, description=\"Free cash flow\")\n",
    "\n",
    "\n",
    "class BusinessSegment(BaseModel):\n",
    "    \"\"\"Information about a business segment\"\"\"\n",
    "\n",
    "    name: str = Field(description=\"Segment name\")\n",
    "    description: str = Field(description=\"Segment description\")\n",
    "    revenue: float = Field(None, description=\"Segment revenue\")\n",
    "    revenue_percentage: Optional[float] = Field(\n",
    "        None, description=\"Percentage of total company revenue\"\n",
    "    )\n",
    "    operating_income: Optional[float] = Field(\n",
    "        None, description=\"Segment operating income\"\n",
    "    )\n",
    "    operating_margin: Optional[float] = Field(\n",
    "        None, description=\"Segment operating margin percentage\"\n",
    "    )\n",
    "    year_over_year_growth: float = Field(\n",
    "        None, description=\"Year-over-year growth percentage\"\n",
    "    )\n",
    "    outlook: Optional[str] = Field(None, description=\"Future outlook for segment\")\n",
    "\n",
    "\n",
    "class GeographicSegment(BaseModel):\n",
    "    \"\"\"Information about a geographic segment\"\"\"\n",
    "\n",
    "    region: str = Field(description=\"Geographic region\")\n",
    "    revenue: float = Field(None, description=\"Revenue from region\")\n",
    "    revenue_percentage: Optional[float] = Field(\n",
    "        None, description=\"Percentage of total company revenue\"\n",
    "    )\n",
    "    year_over_year_growth: Optional[float] = Field(\n",
    "        None, description=\"Year-over-year growth percentage\"\n",
    "    )\n",
    "\n",
    "\n",
    "class RiskFactor(BaseModel):\n",
    "    \"\"\"Information about a risk factor\"\"\"\n",
    "\n",
    "    category: str = Field(\n",
    "        description=\"Risk category (e.g., Market, Operational, Legal)\"\n",
    "    )\n",
    "    title: Optional[str] = Field(None, description=\"Brief title of risk\")\n",
    "    description: str = Field(description=\"Description of risk factor\")\n",
    "    potential_impact: Optional[str] = Field(\n",
    "        None, description=\"Potential business impact\"\n",
    "    )\n",
    "\n",
    "\n",
    "class ManagementHighlights(BaseModel):\n",
    "    \"\"\"Key highlights from Management Discussion & Analysis\"\"\"\n",
    "\n",
    "    business_overview: str = Field(description=\"Overview of business and strategy\")\n",
    "    key_trends: Optional[str] = Field(\n",
    "        None, description=\"Key trends affecting performance\"\n",
    "    )\n",
    "    liquidity_assessment: Optional[str] = Field(\n",
    "        None, description=\"Management assessment of liquidity\"\n",
    "    )\n",
    "    outlook_summary: str = Field(description=\"Future outlook/guidance\")\n",
    "\n",
    "\n",
    "class SECFiling(BaseModel):\n",
    "    \"\"\"Schema for parsing 10-K and 10-Q filings from the SEC\"\"\"\n",
    "\n",
    "    filing_info: FilingInfo = Field(description=\"Basic information about the filing\")\n",
    "    company_profile: CompanyProfile = Field(description=\"Essential company information\")\n",
    "    financial_highlights: FinancialHighlights = Field(\n",
    "        description=\"Key financial metrics from this reporting period\"\n",
    "    )\n",
    "    business_segments: Optional[List[BusinessSegment]] = Field(\n",
    "        None, description=\"Key business segments information\"\n",
    "    )\n",
    "    geographic_segments: Optional[List[GeographicSegment]] = Field(\n",
    "        None, description=\"Geographic segment information\"\n",
    "    )\n",
    "    key_risks: List[RiskFactor] = Field(description=\"Most significant risk factors\")\n",
    "    mda_highlights: ManagementHighlights = Field(\n",
    "        description=\"Key highlights from Management Discussion & Analysis\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a0498e9",
   "metadata": {},
   "source": [
    "### 2. Extracting Information from $NVDA 10-K Filing\n",
    "\n",
    "Take a look at the schema definition above. We've defined a few models to represent the different sections of the 10K/Q filing. \n",
    "We've also defined a `SECFiling` model that combines all the sections into a single model. \n",
    "\n",
    "\n",
    "#### Design Considerations for Schema Definition\n",
    "\n",
    "- **Optional Fields**: There are quite a few optional fields in the schema. There are many fields that we would like to extract if present, but we know that they are not present in all filings. \n",
    "  e.g. companies which only has a US footprint will not have a geographic breakdown of their financials. It is important to designate these fields as optional so that the LLM is not \n",
    "  forced to make up values for these fields. Designating these fields as optional helps provide an escape hatch for the LLM to not hallucinate values for these fields. Note, however, that if aggressively marking fields as optional might result in the LLM being overly lazy and not attempt to extract information for these fields. So there's a balance in what fields to mark as optional! \n",
    "- **Descriptions for Fields**: While not mandatory, it is always a good idea to provide a description for each field. This helps the LLM understand the context in which the field is being extracted and can improve the accuracy of the extraction.  \n",
    "- **Enums**: We use enums to limit the possible values for a field. e.g. the `FilingInfo` model has an enum for the possible values of `filing_type`.  \n",
    "\n",
    "Now, let us create an agent to extract this information from the 10K/Q filing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d335b32",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_cloud.core.api_error import ApiError\n",
    "\n",
    "try:\n",
    "    existing_agent = llama_extract.get_agent(name=\"sec-10k-filing\")\n",
    "    if existing_agent:\n",
    "        llama_extract.delete_agent(existing_agent.id)\n",
    "except ApiError as e:\n",
    "    if e.status_code == 404:\n",
    "        pass\n",
    "    else:\n",
    "        raise\n",
    "\n",
    "agent = llama_extract.create_agent(name=\"sec-10k-filing\", data_schema=SECFiling)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "532f6ff5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Uploading files: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it]\n",
      "Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  2.78it/s]\n",
      "Extracting files: 100%|██████████| 1/1 [01:31<00:00, 91.56s/it]\n",
      "Uploading files: 100%|██████████| 1/1 [00:01<00:00,  1.26s/it]\n",
      "Creating extraction jobs: 100%|██████████| 1/1 [00:01<00:00,  1.44s/it]\n",
      "Extracting files: 100%|██████████| 1/1 [01:32<00:00, 92.73s/it]\n",
      "Uploading files: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it]\n",
      "Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  2.85it/s]\n",
      "Extracting files: 100%|██████████| 1/1 [00:51<00:00, 51.87s/it]\n"
     ]
    }
   ],
   "source": [
    "nvda_10k_extract = agent.extract(\"./data/sec_filings/nvda_10k.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83009725",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'filing_info': {'filing_type': '10-K',\n",
       "  'filing_date': '',\n",
       "  'reporting_period_end': '2025-01-26',\n",
       "  'fiscal_year': 2025,\n",
       "  'fiscal_quarter': 0},\n",
       " 'company_profile': {'name': 'NVIDIA Corporation',\n",
       "  'ticker': 'NVDA',\n",
       "  'reporting_currency': 'USD',\n",
       "  'exchanges': ['The Nasdaq Global Select Market'],\n",
       "  'auditor': None},\n",
       " 'financial_highlights': {'period_end': '2025-01-26',\n",
       "  'comparison_period_end': '2024-01-28',\n",
       "  'currency': 'USD',\n",
       "  'unit': 'thousands',\n",
       "  'revenue': 68038.0,\n",
       "  'revenue_prior_period': 26974.0,\n",
       "  'revenue_growth': 0.0,\n",
       "  'gross_profit': None,\n",
       "  'gross_margin': 75.0,\n",
       "  'operating_income': None,\n",
       "  'operating_margin': None,\n",
       "  'net_income': 72880.0,\n",
       "  'net_margin': None,\n",
       "  'eps': None,\n",
       "  'diluted_eps': None,\n",
       "  'ebitda': None,\n",
       "  'free_cash_flow': None},\n",
       " 'business_segments': [{'name': 'Compute & Networking',\n",
       "   'description': 'Strong demand for our accelerated computing and AI solutions. Revenue from Data Center computing grew 162% driven primarily by demand for our Hopper computing platform used for large language models, recommendation engines, and generative AI applications. Revenue from Data Center networking grew 51% driven by Ethernet for AI revenue, which includes Spectrum-X end-to-end ethernet platform.',\n",
       "   'revenue': 116193.0,\n",
       "   'revenue_percentage': 89.05,\n",
       "   'operating_income': 82875.0,\n",
       "   'operating_margin': 71.33,\n",
       "   'year_over_year_growth': 145.0,\n",
       "   'outlook': None},\n",
       "  {'name': 'Graphics',\n",
       "   'description': 'The year over year increase was driven by sales of our GeForce RTX 40 Series GPUs.',\n",
       "   'revenue': 14304.0,\n",
       "   'revenue_percentage': 10.95,\n",
       "   'operating_income': 5085.0,\n",
       "   'operating_margin': 35.55,\n",
       "   'year_over_year_growth': 6.0,\n",
       "   'outlook': None},\n",
       "  {'name': 'Data Center',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 115186.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'Compute',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 102196.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'Networking',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 12990.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'Gaming',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 11350.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'Professional Visualization',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 1878.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'Automotive',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 1694.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None},\n",
       "  {'name': 'OEM and Other',\n",
       "   'description': 'Revenue by End Market',\n",
       "   'revenue': 389.0,\n",
       "   'revenue_percentage': None,\n",
       "   'operating_income': None,\n",
       "   'operating_margin': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'outlook': None}],\n",
       " 'geographic_segments': [{'region': 'Outside of the United States',\n",
       "   'revenue': None,\n",
       "   'revenue_percentage': 53.0,\n",
       "   'year_over_year_growth': -3.0},\n",
       "  {'region': 'United States',\n",
       "   'revenue': 61257.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None},\n",
       "  {'region': 'Singapore',\n",
       "   'revenue': 23684.0,\n",
       "   'revenue_percentage': 18.0,\n",
       "   'year_over_year_growth': None},\n",
       "  {'region': 'Taiwan',\n",
       "   'revenue': 20573.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None},\n",
       "  {'region': 'China (including Hong Kong)',\n",
       "   'revenue': 17108.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None},\n",
       "  {'region': 'Other',\n",
       "   'revenue': 7875.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None}],\n",
       " 'key_risks': [{'category': 'Operational',\n",
       "   'title': 'Supply-chain attacks or other business disruptions',\n",
       "   'description': 'We cannot guarantee that third parties and infrastructure in our supply chain or our partners’ supply chains have not been compromised or that they do not contain exploitable vulnerabilities, defects or bugs that could result in a breach of or disruption to our information technology systems, including our products and services, or the third-party information technology systems that support our services.',\n",
       "   'potential_impact': \"Potential reputational damage, regulatory scrutiny, or adverse impacts on the performance and reliability of our products, which could, in turn, affect our partners' operations, customer trust, and our revenue.\"},\n",
       "  {'category': 'Operational',\n",
       "   'title': \"Limited insight into third-party suppliers' data privacy or security practices\",\n",
       "   'description': 'Our ability to monitor these third parties’ information security practices is limited, and they may not have adequate information security measures in place.',\n",
       "   'potential_impact': 'If one of our third-party suppliers suffers a security incident, our response may be limited or more difficult because we may not have direct access to their systems, logs and other information related to the security incident.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Business disruptions',\n",
       "   'description': 'Business disruptions could harm our operations, lead to a decline in revenue and increase our costs. Factors that have caused and/or could in the future cause disruptions to our worldwide operations include: natural disasters, extreme weather conditions, power or water shortages, critical infrastructure failures, telecommunications failures, supplier disruptions, terrorist attacks, acts of violence, political and/or civil unrest, acts of war or other military actions, epidemics or pandemics, abrupt regulatory changes, and other natural or man-made disasters and catastrophic events.',\n",
       "   'potential_impact': 'Our operations vulnerable to natural disasters such as earthquakes, wildfires or other business disruptions occurring in these geographical areas. Catastrophic events can also have an impact on third-party vendors who provide us critical infrastructure services for IT and research and development systems and personnel. Geopolitical and domestic political developments and other events beyond our control can increase economic volatility globally. Political instability, changes in government or adverse political developments in or around any of the major countries in which we do business may harm our business, financial condition and results of operations.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Complex laws, rules, regulations, and political actions',\n",
       "   'description': 'We are subject to laws and regulations domestically and worldwide, affecting our operations in areas including, but not limited to, IP ownership and infringement; taxes; import and export requirements and tariffs; anti-corruption, including the Foreign Corrupt Practices Act; business acquisitions; foreign exchange controls and cash repatriation restrictions; foreign ownership and investment; data privacy requirements; competition and antitrust; advertising; employment; product regulations; cybersecurity; environmental, health, and safety requirements; the responsible use of AI; sustainability; cryptocurrency; and consumer laws.',\n",
       "   'potential_impact': 'Compliance with such requirements can be onerous and expensive, could impact our competitive position, and may negatively impact our business operations and ability to manufacture and ship our products. Violations could result in fines, criminal sanctions against us, our officers, or our employees, prohibitions on the conduct of our business, and damage to our reputation.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Export controls and geopolitical tensions',\n",
       "   'description': 'The USG announced export restrictions and export licensing requirements targeting China’s semiconductor and supercomputing industries. These restrictions impact exports of certain chips, as well as software, hardware, equipment and technology used to develop, produce and manufacture certain chips to China (including Hong Kong and Macau) and Russia, and specifically impact our A100 and H100 integrated circuits, DGX or any other systems or boards which incorporate A100 or H100 integrated circuits.',\n",
       "   'potential_impact': 'Such restrictions could increase the costs and burdens to us and our customers, delay or halt deployment of new systems using our products, and reduce the number of new entrants and customers, negatively impacting our business and financial results. Revisions to laws or regulations or their interpretation and enforcement could also result in increased taxation, trade sanctions, the imposition of or increase to import duties or tariffs, restrictions and controls on imports or exports, or other retaliatory actions, which could have an adverse effect on our business plans or impact the timing of our shipments.'},\n",
       "  {'category': 'Environmental',\n",
       "   'title': 'Climate change',\n",
       "   'description': 'Climate change may have an increasingly adverse impact on our business and on our customers, partners and vendors. Water and energy availability and reliability in the regions where we conduct business is critical, and certain of our facilities may be vulnerable to the impacts of extreme weather events.',\n",
       "   'potential_impact': 'Climate change, its impact on our supply chain and critical infrastructure worldwide and its potential to increase political instability in regions where we, our customers, partners and our vendors do business, may disrupt our business and cause us to experience higher attrition, losses and costs to maintain or resume operations. Losses not covered by insurance may be large, which could harm our results of operations and financial condition.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Chinese government restrictions',\n",
       "   'description': 'Restrictions imposed by the Chinese government on the duration of gaming activities and access to games may adversely affect our Gaming revenue, and increased oversight of digital platform companies may adversely affect our Data Center revenue. The Chinese government may also encourage customers to purchase from our China-based competitors, or impose restrictions on the sale to certain customers of our products, or any products containing components made by our partners and suppliers.',\n",
       "   'potential_impact': 'Negatively impact our business and financial results.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Supply chain disruptions',\n",
       "   'description': 'Our business depends on our ability to receive consistent and reliable supply from our overseas partners, especially in Taiwan and South Korea. Any new restrictions that negatively impact our ability to receive supply of components, parts, or services from Taiwan and South Korea, would negatively impact our business and financial results.',\n",
       "   'potential_impact': 'Negatively impact our business and financial results.'},\n",
       "  {'category': 'Reputational',\n",
       "   'title': 'Corporate sustainability practices scrutiny',\n",
       "   'description': 'Increased scrutiny from shareholders, regulators and others regarding our corporate sustainability practices could result in additional costs or risks and adversely impact our reputation and willingness of customers and suppliers to do business with us.',\n",
       "   'potential_impact': 'Negatively harm our brand, reputation and business activities or expose us to liability.'},\n",
       "  {'category': 'Reputational',\n",
       "   'title': 'Responsible use of AI technologies',\n",
       "   'description': 'Issues relating to the responsible use of our technologies, including AI in our offerings, may result in reputational or financial harm and liability. Concerns relating to the responsible use of new and evolving technologies, such as AI, in our products and services may result in reputational or financial harm and liability and may cause us to incur costs to resolve such issues.',\n",
       "   'potential_impact': 'Brand or reputational harm, competitive harm or legal liability.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Intellectual property rights protection',\n",
       "   'description': 'Actions to adequately protect our IP rights could result in substantial costs to us and our ability to compete could be harmed if we are unsuccessful or if we are prohibited from making or selling our products.',\n",
       "   'potential_impact': 'Increase our operating expenses and negatively impact our operating results.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Data privacy and security laws',\n",
       "   'description': 'We are subject to stringent and changing data privacy and security laws, rules, regulations and other obligations. These areas could damage our reputation, deter current and potential customers, affect our product design, or result in legal or regulatory proceedings and liability.',\n",
       "   'potential_impact': 'Material adverse effect on our reputation, business, or financial condition.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Tax liabilities and changes in tax laws',\n",
       "   'description': 'We may have exposure to additional tax liabilities and our operating results may be adversely impacted by changes in tax laws, higher than expected tax rates and other tax-related factors.',\n",
       "   'potential_impact': 'Adversely affect our provision for income taxes, cash tax payments, results of operations, and financial condition.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Litigation, investigations and regulatory proceedings',\n",
       "   'description': 'Our business is exposed to the burden and risks associated with litigation, investigations and regulatory proceedings.',\n",
       "   'potential_impact': 'Costly, time-consuming, and disruptive to our operations.'}],\n",
       " 'mda_highlights': {'business_overview': 'NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. NVIDIA is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry. Our contracts may contain more than one performance obligation. Judgement is required in determining whether each performance obligation within a customer contract is distinct. Except for License and Development Arrangements, NVIDIA products and services function on a standalone basis and do not require a significant amount of integration or interdependency. Therefore, multiple performance obligations contained within a customer contract are considered distinct and are not combined for revenue recognition purposes.',\n",
       "  'key_trends': None,\n",
       "  'liquidity_assessment': None,\n",
       "  'outlook_summary': 'We believe that we have sufficient liquidity to meet our operating requirements for at least the next twelve months and thereafter for the foreseeable future, including our future supply obligations and share purchases. We continuously evaluate our liquidity and capital resources, including our access to external capital, to ensure we can finance future capital requirements.'}}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nvda_10k_extract.data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b299a3e5",
   "metadata": {},
   "source": [
    "### 3. Assessing the Extraction Results\n",
    "\n",
    "Let's take a look at the extraction results for Nvidia's 10K filing. The description for management highlights and key risks looks reasonable at first glance. It is hard to verify the accuracy of the financial metrics since this is a long document with many pages.\n",
    "\n",
    "#### Adding Page Numbers to the Extraction Schema\n",
    "\n",
    "One way to make it easier to verify the accuracy of the extraction results is to add the page numbers to the extraction schema. This way, we can see which page numbers contain the key financial information. Let us add a `page_numbers` as a sub-field to `FinancialHighlights`, `BusinessSegment` and `GeographicSegment` fields to make it easier for us to verify key financial metrics extracted. \n",
    "\n",
    "> **Note**: Page numbers might be off by one due to the relative placement of the page numbers and the surrounding context from which the information is extracted, but it is a quick way to navigate to the relevant sections of the document and sanity test some fields.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d20eca24",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'$defs': {'BusinessSegment': {'description': 'Information about a business segment',\n",
       "   'properties': {'name': {'description': 'Segment name',\n",
       "     'title': 'Name',\n",
       "     'type': 'string'},\n",
       "    'description': {'description': 'Segment description',\n",
       "     'title': 'Description',\n",
       "     'type': 'string'},\n",
       "    'revenue': {'default': None,\n",
       "     'description': 'Segment revenue',\n",
       "     'title': 'Revenue',\n",
       "     'type': 'number'},\n",
       "    'revenue_percentage': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Percentage of total company revenue',\n",
       "     'title': 'Revenue Percentage'},\n",
       "    'operating_income': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Segment operating income',\n",
       "     'title': 'Operating Income'},\n",
       "    'operating_margin': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Segment operating margin percentage',\n",
       "     'title': 'Operating Margin'},\n",
       "    'year_over_year_growth': {'default': None,\n",
       "     'description': 'Year-over-year growth percentage',\n",
       "     'title': 'Year Over Year Growth',\n",
       "     'type': 'number'},\n",
       "    'outlook': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Future outlook for segment',\n",
       "     'title': 'Outlook'},\n",
       "    'page_numbers': {'description': 'Page numbers (at bottom of the page) where the financial metrics above are extracted from.',\n",
       "     'items': {'type': 'integer'},\n",
       "     'title': 'Page Numbers',\n",
       "     'type': 'array'}},\n",
       "   'required': ['name', 'description', 'page_numbers'],\n",
       "   'title': 'BusinessSegment',\n",
       "   'type': 'object'},\n",
       "  'CompanyProfile': {'description': 'Essential company information',\n",
       "   'properties': {'name': {'description': 'Legal name of company',\n",
       "     'title': 'Name',\n",
       "     'type': 'string'},\n",
       "    'ticker': {'description': 'Stock ticker symbol',\n",
       "     'title': 'Ticker',\n",
       "     'type': 'string'},\n",
       "    'reporting_currency': {'description': 'Currency used in financial statements',\n",
       "     'title': 'Reporting Currency',\n",
       "     'type': 'string'},\n",
       "    'exchanges': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'},\n",
       "      {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Stock exchanges where listed',\n",
       "     'title': 'Exchanges'},\n",
       "    'auditor': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': \"Company's auditor\",\n",
       "     'title': 'Auditor'}},\n",
       "   'required': ['name', 'ticker', 'reporting_currency'],\n",
       "   'title': 'CompanyProfile',\n",
       "   'type': 'object'},\n",
       "  'FilingInfo': {'description': 'Basic information about the SEC filing',\n",
       "   'properties': {'filing_type': {'description': 'Type of SEC filing',\n",
       "     'enum': ['10-K', '10-Q', '10-K/A', '10-Q/A'],\n",
       "     'title': 'Filing Type',\n",
       "     'type': 'string'},\n",
       "    'filing_date': {'description': 'Date when filing was submitted to SEC',\n",
       "     'title': 'Filing Date',\n",
       "     'type': 'string'},\n",
       "    'reporting_period_end': {'description': 'End date of reporting period',\n",
       "     'title': 'Reporting Period End',\n",
       "     'type': 'string'},\n",
       "    'fiscal_year': {'description': 'Fiscal year',\n",
       "     'title': 'Fiscal Year',\n",
       "     'type': 'integer'},\n",
       "    'fiscal_quarter': {'description': 'Fiscal quarter (if 10-Q)',\n",
       "     'maximum': 4,\n",
       "     'minimum': 1,\n",
       "     'title': 'Fiscal Quarter',\n",
       "     'type': 'integer'}},\n",
       "   'required': ['filing_type',\n",
       "    'filing_date',\n",
       "    'reporting_period_end',\n",
       "    'fiscal_year',\n",
       "    'fiscal_quarter'],\n",
       "   'title': 'FilingInfo',\n",
       "   'type': 'object'},\n",
       "  'FinancialHighlights': {'description': 'Key financial metrics from this reporting period',\n",
       "   'properties': {'period_end': {'description': 'End date of reporting period',\n",
       "     'title': 'Period End',\n",
       "     'type': 'string'},\n",
       "    'comparison_period_end': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'End date of comparison period (typically prior year/quarter)',\n",
       "     'title': 'Comparison Period End'},\n",
       "    'currency': {'description': 'Currency of financial figures',\n",
       "     'title': 'Currency',\n",
       "     'type': 'string'},\n",
       "    'unit': {'description': 'Unit of financial figures (thousands, millions, etc.)',\n",
       "     'title': 'Unit',\n",
       "     'type': 'string'},\n",
       "    'revenue': {'description': 'Total revenue for period',\n",
       "     'title': 'Revenue',\n",
       "     'type': 'number'},\n",
       "    'revenue_prior_period': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Revenue from comparison period',\n",
       "     'title': 'Revenue Prior Period'},\n",
       "    'revenue_growth': {'description': 'Revenue growth percentage',\n",
       "     'title': 'Revenue Growth',\n",
       "     'type': 'number'},\n",
       "    'gross_profit': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Gross profit',\n",
       "     'title': 'Gross Profit'},\n",
       "    'gross_margin': {'description': 'Gross margin percentage',\n",
       "     'title': 'Gross Margin',\n",
       "     'type': 'number'},\n",
       "    'operating_income': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Operating income',\n",
       "     'title': 'Operating Income'},\n",
       "    'operating_margin': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Operating margin percentage',\n",
       "     'title': 'Operating Margin'},\n",
       "    'net_income': {'description': 'Net income',\n",
       "     'title': 'Net Income',\n",
       "     'type': 'number'},\n",
       "    'net_margin': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Net margin percentage',\n",
       "     'title': 'Net Margin'},\n",
       "    'eps': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Basic earnings per share',\n",
       "     'title': 'Eps'},\n",
       "    'diluted_eps': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Diluted earnings per share',\n",
       "     'title': 'Diluted Eps'},\n",
       "    'ebitda': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'EBITDA (Earnings Before Interest, Taxes, Depreciation, Amortization)',\n",
       "     'title': 'Ebitda'},\n",
       "    'free_cash_flow': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Free cash flow',\n",
       "     'title': 'Free Cash Flow'},\n",
       "    'page_numbers': {'description': 'Page numbers (at bottom of the page) where the financial metrics above are extracted from.',\n",
       "     'items': {'type': 'integer'},\n",
       "     'title': 'Page Numbers',\n",
       "     'type': 'array'}},\n",
       "   'required': ['period_end',\n",
       "    'currency',\n",
       "    'unit',\n",
       "    'revenue',\n",
       "    'revenue_growth',\n",
       "    'gross_margin',\n",
       "    'net_income',\n",
       "    'page_numbers'],\n",
       "   'title': 'FinancialHighlights',\n",
       "   'type': 'object'},\n",
       "  'GeographicSegment': {'description': 'Information about a geographic segment',\n",
       "   'properties': {'region': {'description': 'Geographic region',\n",
       "     'title': 'Region',\n",
       "     'type': 'string'},\n",
       "    'revenue': {'default': None,\n",
       "     'description': 'Revenue from region',\n",
       "     'title': 'Revenue',\n",
       "     'type': 'number'},\n",
       "    'revenue_percentage': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Percentage of total company revenue',\n",
       "     'title': 'Revenue Percentage'},\n",
       "    'year_over_year_growth': {'anyOf': [{'type': 'number'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Year-over-year growth percentage',\n",
       "     'title': 'Year Over Year Growth'},\n",
       "    'page_numbers': {'description': 'Page numbers (at bottom of the page) where the financial metrics above are extracted from.',\n",
       "     'items': {'type': 'integer'},\n",
       "     'title': 'Page Numbers',\n",
       "     'type': 'array'}},\n",
       "   'required': ['region', 'page_numbers'],\n",
       "   'title': 'GeographicSegment',\n",
       "   'type': 'object'},\n",
       "  'ManagementHighlights': {'description': 'Key highlights from Management Discussion & Analysis',\n",
       "   'properties': {'business_overview': {'description': 'Overview of business and strategy',\n",
       "     'title': 'Business Overview',\n",
       "     'type': 'string'},\n",
       "    'key_trends': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Key trends affecting performance',\n",
       "     'title': 'Key Trends'},\n",
       "    'liquidity_assessment': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Management assessment of liquidity',\n",
       "     'title': 'Liquidity Assessment'},\n",
       "    'outlook_summary': {'description': 'Future outlook/guidance',\n",
       "     'title': 'Outlook Summary',\n",
       "     'type': 'string'}},\n",
       "   'required': ['business_overview', 'outlook_summary'],\n",
       "   'title': 'ManagementHighlights',\n",
       "   'type': 'object'},\n",
       "  'RiskFactor': {'description': 'Information about a risk factor',\n",
       "   'properties': {'category': {'description': 'Risk category (e.g., Market, Operational, Legal)',\n",
       "     'title': 'Category',\n",
       "     'type': 'string'},\n",
       "    'title': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Brief title of risk',\n",
       "     'title': 'Title'},\n",
       "    'description': {'description': 'Description of risk factor',\n",
       "     'title': 'Description',\n",
       "     'type': 'string'},\n",
       "    'potential_impact': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "     'default': None,\n",
       "     'description': 'Potential business impact',\n",
       "     'title': 'Potential Impact'}},\n",
       "   'required': ['category', 'description'],\n",
       "   'title': 'RiskFactor',\n",
       "   'type': 'object'}},\n",
       " 'description': 'Schema for parsing 10-K and 10-Q filings from the SEC',\n",
       " 'properties': {'filing_info': {'$ref': '#/$defs/FilingInfo',\n",
       "   'description': 'Basic information about the filing'},\n",
       "  'company_profile': {'$ref': '#/$defs/CompanyProfile'},\n",
       "  'financial_highlights': {'$ref': '#/$defs/FinancialHighlights'},\n",
       "  'business_segments': {'anyOf': [{'items': {'$ref': '#/$defs/BusinessSegment'},\n",
       "     'type': 'array'},\n",
       "    {'type': 'null'}],\n",
       "   'default': None,\n",
       "   'description': 'Key business segments information',\n",
       "   'title': 'Business Segments'},\n",
       "  'geographic_segments': {'anyOf': [{'items': {'$ref': '#/$defs/GeographicSegment'},\n",
       "     'type': 'array'},\n",
       "    {'type': 'null'}],\n",
       "   'default': None,\n",
       "   'description': 'Geographic segment information',\n",
       "   'title': 'Geographic Segments'},\n",
       "  'key_risks': {'description': 'Most significant risk factors',\n",
       "   'items': {'$ref': '#/$defs/RiskFactor'},\n",
       "   'title': 'Key Risks',\n",
       "   'type': 'array'},\n",
       "  'mda_highlights': {'$ref': '#/$defs/ManagementHighlights'}},\n",
       " 'required': ['filing_info',\n",
       "  'company_profile',\n",
       "  'financial_highlights',\n",
       "  'key_risks',\n",
       "  'mda_highlights'],\n",
       " 'title': 'SECFiling',\n",
       " 'type': 'object'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pydantic.fields import FieldInfo\n",
    "\n",
    "FinancialHighlights.__annotations__[\"page_numbers\"] = List[int]\n",
    "FinancialHighlights.model_fields[\"page_numbers\"] = FieldInfo(\n",
    "    annotation=List[int],\n",
    "    description=\"Page numbers (at bottom of the page) where the financial metrics above are extracted from.\",\n",
    ")\n",
    "FinancialHighlights.model_rebuild(force=True)\n",
    "\n",
    "BusinessSegment.model_fields[\"page_numbers\"] = FieldInfo(\n",
    "    annotation=List[int],\n",
    "    description=\"Page numbers (at bottom of the page) where the financial metrics above are extracted from.\",\n",
    ")\n",
    "BusinessSegment.model_rebuild(force=True)\n",
    "\n",
    "GeographicSegment.model_fields[\"page_numbers\"] = FieldInfo(\n",
    "    annotation=List[int],\n",
    "    description=\"Page numbers (at bottom of the page) where the financial metrics above are extracted from.\",\n",
    ")\n",
    "GeographicSegment.model_rebuild(force=True)\n",
    "\n",
    "SECFiling.model_rebuild(force=True)\n",
    "SECFiling.model_json_schema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fcd3adb",
   "metadata": {},
   "outputs": [],
   "source": [
    "agent.data_schema = SECFiling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ccd0a19",
   "metadata": {},
   "outputs": [],
   "source": [
    "nvda_10k_extract = agent.extract(\"./data/sec_filings/nvda_10k.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a733774b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'filing_info': {'filing_type': '10-K',\n",
       "  'filing_date': '2025-01-26',\n",
       "  'reporting_period_end': '2025-01-26',\n",
       "  'fiscal_year': 2025,\n",
       "  'fiscal_quarter': 1},\n",
       " 'company_profile': {'name': 'NVIDIA Corporation',\n",
       "  'ticker': 'NVDA',\n",
       "  'reporting_currency': 'USD',\n",
       "  'exchanges': ['The Nasdaq Global Select Market'],\n",
       "  'auditor': None},\n",
       " 'financial_highlights': {'period_end': '2025-01-26',\n",
       "  'comparison_period_end': '2024-01-28',\n",
       "  'currency': 'USD',\n",
       "  'unit': 'thousands',\n",
       "  'revenue': 130497.0,\n",
       "  'revenue_prior_period': 60922.0,\n",
       "  'revenue_growth': 114.23,\n",
       "  'gross_profit': 97858.0,\n",
       "  'gross_margin': 75.0,\n",
       "  'operating_income': 81453.0,\n",
       "  'operating_margin': None,\n",
       "  'net_income': 72880.0,\n",
       "  'net_margin': 55.8,\n",
       "  'eps': None,\n",
       "  'diluted_eps': None,\n",
       "  'ebitda': None,\n",
       "  'free_cash_flow': None,\n",
       "  'page_numbers': [40, 41, 55, 56, 68]},\n",
       " 'business_segments': [{'name': 'Compute & Networking',\n",
       "   'description': 'Includes Data Center accelerated computing platforms and AI solutions and software; networking; automotive platforms and autonomous and electric vehicle solutions; Jetson for robotics and other embedded platforms; and DGX Cloud computing services. Strong demand for accelerated computing and AI solutions. Revenue from Data Center computing grew 162% driven primarily by demand for our Hopper computing platform used for large language models, recommendation engines, and generative AI applications. Revenue from Data Center networking grew 51% driven by Ethernet for AI revenue, which includes Spectrum-X end-to-end ethernet platform. Includes product costs and inventory provisions, compensation and benefits excluding stock-based compensation expense, compute and infrastructure expenses, and engineering development costs.',\n",
       "   'revenue': 116193.0,\n",
       "   'revenue_percentage': 88.99,\n",
       "   'operating_income': 82875.0,\n",
       "   'operating_margin': 71.3,\n",
       "   'year_over_year_growth': 145.0,\n",
       "   'outlook': 'Higher U.S.-based Compute & Networking segment demand.',\n",
       "   'page_numbers': [5, 40, 68, 79]},\n",
       "  {'name': 'Graphics',\n",
       "   'description': 'Includes GeForce GPUs for gaming and PCs, the GeForce NOW game streaming service and related infrastructure, and solutions for gaming platforms; Quadro/NVIDIA RTX GPUs for enterprise workstation graphics; virtual GPU, or vGPU, software for cloud-based visual and virtual computing; automotive platforms for infotainment systems; and Omniverse Enterprise software for building and operating industrial AI and digital twin applications. The year over year increase was driven by sales of our GeForce RTX 40 Series GPUs. Includes product costs and inventory provisions, compensation and benefits excluding stock-based compensation expense, compute and infrastructure expenses, and engineering development costs.',\n",
       "   'revenue': 14304.0,\n",
       "   'revenue_percentage': 11.0,\n",
       "   'operating_income': 5085.0,\n",
       "   'operating_margin': 35.6,\n",
       "   'year_over_year_growth': 6.0,\n",
       "   'outlook': None,\n",
       "   'page_numbers': [5, 40, 68, 79]}],\n",
       " 'geographic_segments': [{'region': 'Outside of the United States',\n",
       "   'revenue': None,\n",
       "   'revenue_percentage': 53.0,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [42]},\n",
       "  {'region': 'United States',\n",
       "   'revenue': 61257.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [79]},\n",
       "  {'region': 'Singapore',\n",
       "   'revenue': 23684.0,\n",
       "   'revenue_percentage': 18.0,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [79]},\n",
       "  {'region': 'Taiwan',\n",
       "   'revenue': 20573.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [79]},\n",
       "  {'region': 'China (including Hong Kong)',\n",
       "   'revenue': 17108.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [79]},\n",
       "  {'region': 'Other',\n",
       "   'revenue': 7875.0,\n",
       "   'revenue_percentage': None,\n",
       "   'year_over_year_growth': None,\n",
       "   'page_numbers': [79]}],\n",
       " 'key_risks': [{'category': 'Regulatory, Legal, Our Stock, and Other Matters',\n",
       "   'title': 'Risks Related to Regulatory, Legal, Our Stock, and Other Matters',\n",
       "   'description': 'We are subject to complex laws, rules, regulations, and political and other actions, including restrictions on the export of our products, which may adversely impact our business.',\n",
       "   'potential_impact': None},\n",
       "  {'category': 'Regulatory, Legal',\n",
       "   'title': 'Increased scrutiny regarding our corporate sustainability practices could result in financial, reputational, or operational harm and liability.',\n",
       "   'description': 'Increased scrutiny regarding our corporate sustainability practices could result in financial, reputational, or operational harm and liability.',\n",
       "   'potential_impact': None},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Issues relating to the responsible use of our technologies, including AI',\n",
       "   'description': 'Issues relating to the responsible use of our technologies, including AI, may result in reputational or financial harm and liability.',\n",
       "   'potential_impact': None},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Supply-chain attacks or other business disruptions',\n",
       "   'description': \"We cannot guarantee that third parties and infrastructure in our supply chain or our partners’ supply chains have not been compromised or that they do not contain exploitable vulnerabilities, defects or bugs that could result in a breach of or disruption to our information technology systems, including our products and services, or the third-party information technology systems that support our services. We have incorporated third-party data into some of our AI models and used open-source datasets to train our models and may continue to do so. These datasets may be flawed, insufficient, or contain certain biased information, and may otherwise decrease resilience to security incidents that may compromise the integrity of our AI outputs, leading to potential reputational damage, regulatory scrutiny, or adverse impacts on the performance and reliability of our products, which could, in turn, affect our partners' operations, customer trust, and our revenue.\",\n",
       "   'potential_impact': \"Potential reputational damage, regulatory scrutiny, or adverse impacts on the performance and reliability of our products, which could, in turn, affect our partners' operations, customer trust, and our revenue.\"},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Limited insight into data privacy or security practices of third-party suppliers',\n",
       "   'description': 'Our ability to monitor these third parties’ information security practices is limited, and they may not have adequate information security measures in place. In addition, if one of our third-party suppliers suffers a security incident (which has happened in the past and may happen in the future), our response may be limited or more difficult because we may not have direct access to their systems, logs and other information related to the security incident.',\n",
       "   'potential_impact': 'Potential liability and harm to our business if our products or services are compromised, affecting a significant number of our customers and their data.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Business disruptions',\n",
       "   'description': 'Business disruptions could harm our operations, lead to a decline in revenue and increase our costs. Factors that have caused and/or could in the future cause disruptions to our worldwide operations include: natural disasters, extreme weather conditions, power or water shortages, critical infrastructure failures, telecommunications failures, supplier disruptions, terrorist attacks, acts of violence, political and/or civil unrest, acts of war or other military actions, epidemics or pandemics, abrupt regulatory changes, and other natural or man-made disasters and catastrophic events.',\n",
       "   'potential_impact': 'Harm to our operations, lead to a decline in revenue and increase our costs.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Geopolitical tensions and conflicts',\n",
       "   'description': 'Worldwide geopolitical tensions and conflicts, including but not limited to China, Hong Kong, Israel, Korea and Taiwan where the manufacture of our product components and final assembly of our products are concentrated may result in changing regulatory requirements, and other disruptions that could impact our operations and operating strategies, product demand, access to global markets, hiring, and profitability.',\n",
       "   'potential_impact': 'Our operations could be harmed and our costs could increase if manufacturing, logistics, or other operations are disrupted for any reason, including natural disasters, high heat events, water shortages, power shortages, information technology system failures or cyber-attacks, military actions or economic, and business, labor, environmental, public health, or political issues.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Interruptions or delays in services from CSPs, data center co-location partners, and other third parties',\n",
       "   'description': 'Interruptions or delays in services from CSPs, data center co-location partners, and other third parties on which we rely, including due to the events described above or other events such as the insolvency of these parties, could impair our ability to provide our products and services and harm our business.',\n",
       "   'potential_impact': 'Impair our ability to provide our products and services and harm our business.'},\n",
       "  {'category': 'Environmental',\n",
       "   'title': 'Climate change',\n",
       "   'description': 'Climate change may have an increasingly adverse impact on our business and on our customers, partners and vendors. Water and energy availability and reliability in the regions where we conduct business is critical, and certain of our facilities may be vulnerable to the impacts of extreme weather events.',\n",
       "   'potential_impact': 'Disrupt our business and cause us to experience higher attrition, losses and costs to maintain or resume operations. Losses not covered by insurance may be large, which could harm our results of operations and financial condition.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Export controls and geopolitical tensions',\n",
       "   'description': 'The USG announced export restrictions and export licensing requirements targeting China’s semiconductor and supercomputing industries. These restrictions impact exports of certain chips, as well as software, hardware, equipment and technology used to develop, produce and manufacture our products.',\n",
       "   'potential_impact': 'Could increase the costs and burdens to us and our customers, delay or halt deployment of new systems using our products, and reduce the number of new entrants and customers, negatively impacting our business and financial results.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Chinese government restrictions',\n",
       "   'description': 'Restrictions imposed by the Chinese government on the duration of gaming activities and access to games may adversely affect our Gaming revenue, and increased oversight of digital platform companies may adversely affect our Data Center revenue. The Chinese government may also encourage customers to purchase from our China-based competitors, or impose restrictions on the sale to certain customers of our products, or any products containing components made by our partners and suppliers.',\n",
       "   'potential_impact': 'Negatively impact our business and financial results.'},\n",
       "  {'category': 'Operational',\n",
       "   'title': 'Supply chain disruptions',\n",
       "   'description': 'Our business depends on our ability to receive consistent and reliable supply from our overseas partners, especially in Taiwan and South Korea. Any new restrictions that negatively impact our ability to receive supply of components, parts, or services from Taiwan and South Korea, would negatively impact our business and financial results.',\n",
       "   'potential_impact': 'Negatively impact our business and financial results.'},\n",
       "  {'category': 'Reputational',\n",
       "   'title': 'Corporate sustainability practices scrutiny',\n",
       "   'description': 'Increased scrutiny from shareholders, regulators and others regarding our corporate sustainability practices could result in additional costs or risks and adversely impact our reputation and willingness of customers and suppliers to do business with us.',\n",
       "   'potential_impact': 'Negatively harm our brand, reputation and business activities or expose us to liability.'},\n",
       "  {'category': 'Reputational/Legal',\n",
       "   'title': 'Responsible use of AI technologies',\n",
       "   'description': 'Issues relating to the responsible use of our technologies, including AI in our offerings, may result in reputational or financial harm and liability. Concerns relating to the responsible use of new and evolving technologies, such as AI, in our products and services may result in reputational or financial harm and liability and may cause us to incur costs to resolve such issues.',\n",
       "   'potential_impact': 'Reputational or financial harm and liability.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Intellectual property rights protection',\n",
       "   'description': 'Actions to adequately protect our IP rights could result in substantial costs to us and our ability to compete could be harmed if we are unsuccessful or if we are prohibited from making or selling our products.',\n",
       "   'potential_impact': 'Our business could be negatively impacted.'},\n",
       "  {'category': 'Regulatory',\n",
       "   'title': 'Data privacy and security laws',\n",
       "   'description': 'We are subject to stringent and changing data privacy and security laws, rules, regulations and other obligations. These areas could damage our reputation, deter current and potential customers, affect our product design, or result in legal or regulatory proceedings and liability.',\n",
       "   'potential_impact': 'Material adverse effect on our reputation, business, or financial condition.'},\n",
       "  {'category': 'Legal/Financial',\n",
       "   'title': 'Tax liabilities and changes in tax laws',\n",
       "   'description': 'We may have exposure to additional tax liabilities and our operating results may be adversely impacted by changes in tax laws, higher than expected tax rates and other tax-related factors.',\n",
       "   'potential_impact': 'Adversely affect our provision for income taxes, cash tax payments, results of operations, and financial condition.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Litigation, investigations and regulatory proceedings',\n",
       "   'description': 'Our business is exposed to the burden and risks associated with litigation, investigations and regulatory proceedings.',\n",
       "   'potential_impact': 'Litigation can be costly, time-consuming, and disruptive to our operations.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Securities Class Action and Derivative Lawsuits',\n",
       "   'description': 'The plaintiffs in the putative securities class action lawsuit, captioned 4:18-cv-07669-HSG, initially filed on December 21, 2018 in the United States District Court for the Northern District of California, and titled In Re NVIDIA Corporation Securities Litigation, filed an amended complaint on May 13, 2020. The amended complaint asserted that NVIDIA and certain NVIDIA executives violated Section 10(b) of the Securities Exchange Act of 1934, as amended, or the Exchange Act, and SEC Rule 10b-5, by making materially false or misleading statements related to channel inventory and the impact of cryptocurrency mining on GPU demand between May 10, 2017 and November 14, 2018. Plaintiffs also alleged that the NVIDIA executives who they named as defendants violated Section 20(a) of the Exchange Act. Plaintiffs sought class certification, an award of unspecified compensatory damages, an award of reasonable costs and expenses, including attorneys’ fees and expert fees, and further relief as the Court may deem just and proper.',\n",
       "   'potential_impact': 'Unspecified damages and other relief, including reforms and improvements to NVIDIA’s corporate governance and internal procedures.'},\n",
       "  {'category': 'Legal',\n",
       "   'title': 'Insider trading restrictions',\n",
       "   'description': 'You may be subject to insider trading restrictions and/or market abuse laws based on the exchange on which the shares of Common Stock are listed and in applicable jurisdictions, including the United States and your country or your broker’s country, if different, which may affect your ability to accept, acquire, sell or otherwise dispose of shares of Common Stock, rights to shares of Common Stock (e.g., Restricted Stock Units) or rights linked to the value of shares of Common Stock during such times as you are considered to have “inside information” regarding the Company (as defined by the laws in applicable jurisdictions). Local insider trading laws and regulations may prohibit the cancellation or amendment of orders you placed before you possessed inside information. Furthermore, you could be prohibited from (i) disclosing the inside information to any third party, which may include fellow employees and (ii) “tipping” third parties or causing them otherwise to buy or sell securities. Any restrictions under these laws or regulations are separate from and in addition to any restrictions that may be imposed under any applicable insider trading policy of the Company.',\n",
       "   'potential_impact': 'Affect your ability to accept, acquire, sell or otherwise dispose of shares of Common Stock, rights to shares of Common Stock (e.g., Restricted Stock Units) or rights linked to the value of shares of Common Stock during such times as you are considered to have “inside information” regarding the Company.'}],\n",
       " 'mda_highlights': {'business_overview': 'NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. NVIDIA is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry. NVIDIA invents computing technologies that improve lives and address global challenges. Our goal is to integrate sound environmental, social, and corporate governance principles and practices into every aspect of the Company. Headquartered in Santa Clara, California, NVIDIA was incorporated in California in April 1993 and reincorporated in Delaware in April 1998. We refer to customers who purchase products directly from NVIDIA as direct customers, such as AIBs, distributors, ODMs, OEMs, and system integrators. The number of Restricted Stock Units (and the related shares of Common Stock) subject to your Award will be adjusted from time to time for Capitalization Adjustments, as provided in the Plan.',\n",
       "  'key_trends': None,\n",
       "  'liquidity_assessment': 'We believe that we have sufficient liquidity to meet our operating requirements for at least the next twelve months and thereafter for the foreseeable future, including our future supply obligations and share purchases. We continuously evaluate our liquidity and capital resources, including our access to external capital, to ensure we can finance future capital requirements.',\n",
       "  'outlook_summary': 'NVIDIA has a platform strategy, bringing together hardware, systems, software, algorithms, libraries, and services to create unique value for the markets we serve. While the computing requirements of these end markets are diverse, we address them with a unified underlying architecture leveraging our GPUs and networking and software stacks. The programmable nature of our architecture allows us to support several multi-billion-dollar end markets with the same underlying technology by using a variety of software stacks developed either internally or by third-party developers and partners. The large and growing number of developers and installed base across our platforms strengthens our ecosystem and increases the value of our platform to our customers. We committed to purchase or generate enough renewable energy to match 100% of our global electricity usage for offices and data centers under our operational control starting with our fiscal year 2025. In fiscal year 2024, we made progress towards this goal and increased the percentage of our electricity use matched by renewable energy to 76%. By the end of fiscal year 2026, we also aim to engage manufacturing suppliers comprising at least 67% of NVIDIA’s scope 3 category 1 GHG emissions with the goal of effecting supplier adoption of science-based targets. As of January 26, 2025, revenue related to remaining performance obligations from contracts greater than one year in length was $1.7 billion, which includes $1.6 billion from deferred revenue and $151 million which has not yet been billed nor recognized as revenue. Approximately 39% of revenue from contracts greater than one year in length will be recognized over the next twelve months.'}}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nvda_10k_extract.data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20f643ec",
   "metadata": {},
   "source": [
    "#### Verifying Financial Metrics\n",
    "\n",
    "Now let use the page numbers to verify the accuracy of the financial metrics extracted.\n",
    "\n",
    "Here's the relevant financial metrics extracted:\n",
    "\n",
    "```python\n",
    "{\n",
    " 'financial_highlights': {'period_end': '2025-01-26',\n",
    "  'comparison_period_end': '2024-01-28',\n",
    "  'currency': 'USD',\n",
    "  'unit': 'thousands',\n",
    "  'revenue': 130497.0,\n",
    "  'revenue_prior_period': 60922.0,\n",
    "  'revenue_growth': 114.23,\n",
    "  'gross_profit': 97858.0,\n",
    "  'gross_margin': 75.0,\n",
    "  'operating_income': 81453.0,\n",
    "  'operating_margin': None,\n",
    "  'net_income': 72880.0,\n",
    "  'net_margin': 55.8,\n",
    "  'eps': None,\n",
    "  'diluted_eps': None,\n",
    "  'ebitda': None,\n",
    "  'free_cash_flow': None,\n",
    "  'page_numbers': [40, 41, 55, 56, 68]},\n",
    "}\n",
    "```\n",
    "We can see that the gross margin of 75% is extracted fro page 40. The revenue number of 130,497 is extracted from page 41 which also has the breakdown of the revenue by segment.\n",
    "\n",
    "**Page 40 (showing gross margin of 75%):**\n",
    "<img src=\"./data/sec_filings/nvda_10k_page_40.png\" width=\"50%\" alt=\"NVIDIA 10K Page 40\">\n",
    "\n",
    "**Page 41 (showing revenue of 130,497):**\n",
    "<img src=\"./data/sec_filings/nvda_10k_page_41.png\" width=\"50%\" alt=\"NVIDIA 10K Page 41\">\n",
    "\n",
    "You can likewise verify that the geographic breakdown of revenue is extracted from page 79 correctly. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af810951",
   "metadata": {},
   "source": [
    "### General Guidelines for Long-Form Document Extraction\n",
    "\n",
    "- **Schema Iteration using the Web UI**: We have a Web UI with a schema builder that can help you define your schema and iterate on different documents. We have a 10-K/Q schema for you to get started with if you are interested in trying this out. \n",
    "  Start small and build from there! Refer to the tips above. Try your schema on different documents to see whether it generalizes to the target documents.\n",
    "- **Citations**: You can ask the extraction agent to provide page numbers for key figures extracted. This will help you quickly navigate to the relevant section of the document and verify the veracity of the information extracted. \n",
    "  We will have a more robust and convenient citation feature in the future. \n",
    "- **Run scalable batch jobs**: Once you have confidence that the extraction agent is working well, you can use your agent via our [Python SDK](https://github.com/run-llama/llama_cloud_services) to run scalable batch jobs. \n",
    "\n",
    "![Web UI with the 10-K/Q Template](./data/sec_filings/web_ui.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-cloud-services",
   "language": "python",
   "name": "llama-cloud-services"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
